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(54) Method and device for voice activity detection and a communication device 



(57) The inventbn concerns a voice activity detec- 
tion device in wliich an input speech signal (x(n)) is 
divided in subsignals (S(s)) representing specific fre- 
quency bands and noise (N(s)) is estimated in the sub- 
signals. On basis of the estimated noise in the 
sut>signals. subdecision signals (SNR(s)) are generated 
and a voice activity deceion (Vj^d) for the input speech 
signal is formed on basis of the subdecision signals. 
Spectrum components of the input speech signal and a 
noise estimate are calculated and corrpared. More spe- 
cificalty a signal-to-noise ratio is calculated for each 
substgnal and each signal-to-noise ratio represents a 
subdecision signal (SNR(s)). Rom the signal-to-noise 
ratios a value proportional to their sum is calculated and 
compared with a threshold value and a voice activity 
decision signal (7;^^) for the input speech signal is 
formed on basis of the comparison. 
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Description 

This invention relates to a voice activity detection device comprising means for detecting voice activity in an input 
signal, and for making a voice activity decision on basis of the detection. Likewise the invention relates to a method for 

5 detecting voice activity and to a communication device including voice activity detection means. 

A Voice Activity Detector (VAD) determines whether an input signal contains speech or background noise. A typical 
application for a VAD is in wireless communication systems, in which the voice activity detection can be used for con- 
trolling a discontinuous trar^mission system, where transmission is inhibited when speech is not detected. A VAD can 
also t>e used in ag. echo cancenation arvi noise cancellation. 

10 Various methods for voice activity detection are known in prior art The main problem is to reliably detect speech 
from background noise in noisy environments. Patent publication US 5,459,814 presents a method for voice activity 
detection in which an average signal level and zero crossings are cateulated for the speech signal. The solution 
achieves a method which \s computationally sinrple, but which has the drawback that the detection result is not very reli- 
able. Patent publications WO 95A)81 70 and US 5,276,765 present a voice activity detection method in which a spectral 

75 difference between the speech signal and a noise estimate is calculated using LPC (Liner Prediction Coding) parame- 
ters. These put)llcations a^ present an auxiliary VAD detector which controls updating of the noise estimate. The VAD 
methods of all the above mentioned pid)lications have problenns to reliably detect speech when speech power is low 
conrpared to noise power. 

The present invention concerns a voice activity detection device in which an input speech signal is divided in sut)- 

20 signals representing specific frequency barxis and voice activity is detected in the sut>signala On basis of the detection 
of the subsignals, sutxiecision signals are generated and a voice activity decision for the input speech signal is formed 
on basis of the sutxJedsion signals. In the invention spectrum components of the input speech signal and a noise esti- 
mate are calculated and compared. More specifically a signal-to-noise ratio is calculated for each sut>signal and each 
stgnal-to-noise ratio represents a sufcxJec^on signal. From the signal-to-ru>ise ratios a value proportional to their sum 

25 is calculated and compared with a threshold value and a voice activity decision signal for the input speech signal is 
formed on t>asis of tiie comparison. 

For obtaining the signal-to-noise ratios for each sidDsignal a noise estimate is cateulated for each subfrequency 
t>and (i.e. for each sut>signaO. This means tfiat noise can be estimated more accurately and the noise estimate can also 
be updated separately for each subfrequency band. A more accurate noise estimate will lead to a more accurate and 

30 reliat)le voice activity detection decision. Noise estimate accuracy is also improved by using the speech/noise decision 
of the voice activity detection device to control the updating of the background noise estimate. 

A voice activity detection device and a communication device according to the invention is characterized by tiiat it 
comprises means for dividing said input signal in suk>signals representing specific frequericy t>ands, means for estimat- 
ing noise in the subsignate, means for calculating sutxiecision signals on basis of the noise in the siteignals. and 

35 means for making a voice activity decision for the input signal on basis of the subdecision signals. 

A method according to the invention is characterized by that it comprises the steps of dividing said input signal in 
subsignals representing specific frequerx;y barxis. estimating noise in the subsignals. calculating sutxledsk>n signals 
on basis of the noise in the sut>signals, and making a voice activity decision for the input signal on baste of the subde- 
cteion signals. 

40 In the following, the invention \s illustrated in more detail, referring to the enclosed figures, in which 
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Rgure 1 shows shortly the surroundings of use of the voice activity detection device 4 according to the invention. 
The parameter values presented in the following description are exemplary values and desaibe one embodiment of the 
invention, txit they do not by any means limit the function of the method according to the invention to only certain param- 
eter values. Referring to figure 1 a signal coming from a microphone 1 is sampled in an A/D converter 2. As exemplary 
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values rt is assumed that the sample rate of the A/D converter 2 is 8000 Hz, the frame length of the speech codec 3 is 
80 samples, and each speech frame comprises 10 nns of speech. The VAD device 4 can use the same input frame 
length as the speech codec 3 or the length can be an even quotient of the franr^e length used by the speech codec. The 
coded speech signal Is fed further in a transmission branch, eg. to a disoontinous transmission handler 5, which con- 

5 trols transmission according to a decision received from the VAD 4. 

One embodiment of the voice activity detection device according to the invention is described in more detail in fig- 
ure 2. A speech signal coning from the microphone 1 is sampled in an A/D-converter 2 into a digital signal x(n). An input 
frame for the VAD device in Rg. 2 is fomned by taking samples from digital signal x(n). TTiis frame is fed into block 6, in 
which power spectrum components presenting power in predefined barxis are cak:utated. Components proportional to 

10 amplitude or power spectrum of the input frame can t>e calculated using an FFT, a fitter banK or using Gnear predictor 
coefficients. Th'« will be explained in more detail later. If the VAD operates with a speech codec that calculates linear 
prediction coefficients then those coefficients can be received from the speech codec. 

Power spectrum connponents P(f) are cakxilated from the input frame using first Fast Fourier Transform (FFT) as 
presented in figure 3. In the example solution it is assumed that the length of the FFT calculation is 128. Additionally, 

75 power spectrum conponents P(f) are recombined to calculation spectrum conrponents S(s) reducing the number of 
spectrum components from 65 to 8. 

Referring to Rg. 3 a speech frame is brought to wirxicwing block 10, in which it is'multiplied by a predetemrtirfed 
window. The purpose of windowing is in general to enhance the quality of the spectral estimate of a signal and to divide 
the signal into frames in time domain. Because in the windowing used in this exanrple windows partly overlap, the over- 

20 lapping samples are stored in a menrx>ry (block 15) for the next frame. 80 samples are taken from the signal and they 
are cont>ined with 16 samples stored during the previous frame, resulting in a total of 96 samples. Respectively out of 
the last collected 80 samples, the last 16 samples are stored for being used in calculating the next frama 

The 96 samples given this way are multiplied in windowing bkKk 1 0 by a window comprising 96 sample values, the 
8 first values of the window forming the ascending strip 1^ of the window, arKi the 8 last values forming the descending 

25 strip Id of the window, as presented in figure 7. The window l(n) can be defined as follows and is realized in \Aock 1 1 
(figure 6): 

I(n)=(n4.1)/9=lu n=0,..,7 (1) 
30 l(n)=1 = l|y, n=8...,87 

I(n)=(964i)y9 = I □ n=r88,...95 

ReaGzing of windowing (block 11) digitally is prior known to a person skilled in the art of digital signal processing. 

35 K has to t>e notified that in the window the middle 80 values (n=8,..87 or the middle strip 1^) are equal to 1 , and accord- 
ingly multplication by them does not change tiie result and the multiplication can be omitted. Thus only the first 8 sam- 
ples arxl the last 8 samples in the wirxiow need to be multiplied. Because the length of an FFT has to be a power of 
two, in block 12 (figure 6) 32 zeroes (0) are added at the end of the 96 samples obtained from t>lock 11, resulting in a 
speech frame conprising 1 28 samples. Adding samples at the end of a sequence of samples is a simple operation and 

40 the reaGzation of block 1 2 digitally is within the skills of a person skilled in the art. 

After windowing has been carried out in windowing block 10, the spectrum of a speech frame is calculated in bkx^k 
20 employing the Fast Fourier Transform, FFT Samples x(0),x(1),„,x(n);n= 127 (or said 128 samples) in the frame arriv- 
ing to FFT t)lock 20 are transformed to frequency domain employing real FFT (Fast Fourier Transform), giving frequerrcy 
domain sanrples X(0),X(1)...,X(f);f=r64 (more generally f=(n-i-1)/2) , in which each sample comprises a real component 

45 X/f) and an imaginary component Xi(f): 

><(0 = X,(0+yX;(0.f=0,..,64 (2) 

Realizing Fast Fourier Transform digitally is prior known to a person skilled in the art The real and imaginary com- 
50 ponents obtained from the FFT are squared and added together in pairs in squaring block 50, the output of which is the 
power spectrum of the speech frame. If the FFT length is 128, the number of power spectrum components obtained is 
65, which is obtained by dividing the length of the FFT transformation by two and inaementing the result witii 1 , in other 
words the length of FFT/2 + 1 . Accordingly, the power spectrum is obtained from squaring block 50 by calculating the 
sum of the second powers of the real and imaginary components, component t>y component: 

£5 

P(0=X^(0+X^(0.f=0,..,64 (3) 

The function of squaring block 50 can be realized, as is presented in figure 8. by taking the real and imagirmry com- 
ponents to squaring t)locks 51 and 52 (which carry out a simple mathematical squaring, which is prior known to be car- 
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ried out digitally] and try summing the squared components in a summing unit S3. In this way, as the output of squaring 
block 50, power spectrum components P(0), P(1),..,P(f);^64 are obtained and they correspond to the powers of the 
components in the time domain signal at different frequendes as follows Cpresuming that 8 kHz sampling frequency is 
used): 

5 

P(f) for values f » 0 64 corresponds to middle frequencies (/ • 4000/64 Hz) (4) 

After this 8 new power spectrum components, or power spectrum component combinations S(s), s =0,..7 are 
formed in block 60 and they are here called calculation spectrum conponents. The cak;ulation spectrum components 
10 S(s) are formed t>y summing always 7 adjacent power spectrum conponents P(f) for each calculation spectrum com- 
ponent S(s) as follows: 

S(0)= P(1)+P(2)+..+P{7) (5) 
15 S(1)= P(8)+P(9)+..+P(14) 

S(2)= P(15)+P(16)+..+P(21) 
S{3)= P(22H..+P(28) 

20 

S(4)= P(29H..+P(35) 
S{5)= P(36H..+P(42) 

25 S(6)= P(43)+..+P(49) 

S(7)= P{50>4-..+P{56) 

This can be realized, as presented in figure 9, utilizing counter 61 and sunmng unit 62, so that the counter 61 
30 always counts up to seven arvi, controlled by the counter, summing unit 62 always sums seven sut^sequent compo- 
nents and produces a sum as an output In this case the k)west combination component S(0) corresponds to middle 
frequencies [62.5 Hz to 437.5 Hz] and the highest combination component S(7) corresponds to ntiddle frequencies 
[3125 Hz to 3500 Hz]. The frequencies lower than this (below 62.5 Hz) or higher than this (above 3500 Hz) are not 
essential for speech arxi can be ignored. 
35 Instead of using the solution of F^i^re 3, power spectrmi components P(f) can also be calculated from the input 

frame using a f Dter bank as presented in figure 4. The filter bank comprises band^pass filters Hj{z), 7; covering the 

frequency band of interest The filter t>ank can be either uniform or composed of variable bandwidth filters. Typrcally, the 
filter bank outputs are decimated to inprove efficiency. The design and digital implementation of filter banks is known 
to a person skilled in the art Sut>band samples Zj (/) in each band j are calculated from the input signal x(n) using filter 
40 Hj{z). Signal power at each band can be calculated as follows: 

L-1 

S{y) = 5^z//)-z/0 (6) 

/=0 

45 

where, L ^ the number of samples in the sub-band within one input frame. 

When a VAD is used with a speech codec, the calculation spectrum conponents S(s) can be calculated using Un- 
ear Prediction Coefficients (LPC). whrch are caladated by most of the speech codecs used in digital mobile phone sys- 

50 tems. Such an arrangement is presented in figure 5. LPC coefficients are calculated in a speech codec 3 using a 
technique called linear prediction, where a linear fitter is formed. The LPC coefficients of the filter are direct order coef- 
ficients d(i). which can t>e calculated from autocorrelation coefficients ACF(I4. As will t>e shown bebw, the direct order 
coefficients d(i) can be used for calculating calculation spectrum components S(s). The autocorrelatkxi coefficients 
ACF(I4, which can be calculated from input frame samples x(n). can t>e used for calculating the LPC coefficients. If LPC 

55 coefficients or ACF(k) coefficients are not available from the speech codec, they can be calculated from the input frame. 
Autocorrelation ooefficienis ACF(k) are cateulated in the speech codec 3 as follows: 
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N 

ACF{k) = 5;x(/)x(/ - kl k=0.1 ,..,M (7) 



where, 



( 

N is the nurrber off samples in the input frame. 

M ^ the LPC order (e.g., 8). and 

x(i) are the samples in the input frame. 



10 



LPC coefficients d(i). which present the impulse response of the short term analysis f Oter. can be calculated from 
the autocorrelation coefficients ACF(k) using a previously known method, e.g.. the Schur recursion algorithm or the 
Levinson-Durfoin algorithm. 

Amplitude at desired frequency is calculated in tHock 8 shewn in figure 5 from the LPC values using Fast Fourier 
IS Transform (FFT) according to following equation: 



ii2nk/K 



1=0 



(8) 



20 



where. 



K is a constant e.g. 8000 

25 k con-esponds to a frequency for which power is calculated O-e., A(k) corresponds to frequency k/KIs . where fs is 
the sample frequency), and 
M is the order of the short term analysis. 



30 



35 



The amplitude of a desired frequency t)and can be estimated as follows 

(9) 



M-1 

Y,d{i)C^k^,k2J■) 



where 

k1 Is the start index of the frequency band and k2 Is the end index of the frequency band. 

40 The coefffidents C{k^ ,k2,i) can be calculated forehand and they can be saved in a menfK>ry (not shtown) to reduce 
the required computation k>ad. These coefficients can t>e calculated as foltefws: 

k2 

C{/f1,/c2./)= ^ (10) 
45 kmk^ 



An approximation of the signal power at calculation spectrum conrponent S(s) can t>e calculated by inverting the 
square of the amplitude A(k1 ,k2) and by ntuttiplying with ACF(0). The inversion is needed t>ecause the linear predictor 
50 coefficients presents inverse spectrum of the input signal. ACF(0) presents signal power and it is calculated in the equa- 
tion 7. 



55 



S(s) = ^52aoL (11) 



A{k^,k2)^ 

where each calculation spectrum component S(s) is calculated using specific con stan t s k1 and k2 wfiich define the 
t>and limits. 

Above cfifferent ways of cak;ulating the power (cateulation) spectrum conrponents S(s) have been described. 
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Further in Fig. 2 the spectrum of noise N(s), s=0...,7 is estiniated in estiniation block 80 (presented in more detail 
in figure 1 1) when the voice activity detector does not detect speech. Estimation is carried out in block 80 by calculating 
recursively a time-averaged mean value for each spectrum component S(s), s=0,..,7 of the signal brought from block 6: 



A/„(s)=X(s)A/„.i(sH1-Ms))S(s) 



s = 0 7. 



(12) 



In this context A/^y fsj means a calculated noise spectrum estimate for the previous frame, obtained from menK>ry 
83, as presented in figure 1 1 , and N^^s) means an estimate for the present frame (n = frame order number) according 
to the equation above. This calculation is carried out preferably digitally in block 81 , the inputs of which are the spectrum 
10 components S(s) from block 6, the estimate for the previous frame N^-ils) obtained from memory 83 and the value for 
time-constant variable X(s) calculated in bkx;k 82. The updating can be done using faster tinrte-oonstant when input 
spectrum components are S(s) lower than noise estimate Nf^i(s) components. The value of the variable X(s) is det^- 
mined according to the next table (typical values for X(s)): 



IS 



20 



25 



s(s)<w^,rs; 


(Vind* STcount) 


Ms) 


Yes 


(0.0) 


0.85 


No 


(0.0) 


0.9 


Yes 


(0.1) 


0.85 


No 


(0.1) 


0.9 


Yes 


(1.0) 


0.9 


No 


(1.0) 


1 (no updating) 


Yes 


(1.1) 


0.9 


No 


(1.1) 


0.95 



30 

The values Vjnd arxl STcount explained nx^re closely later on. 

In following the symbol N(s) ts used for the noise spectrum estimate calculated for the present frame. The calcula- 
tion according to the at)ove estimation is preferat^ly carried out digitally. Carrying out multiplications, additions and sut>- 
tractions according to the above equation cfigrtally is well known to a person skilled in the art 
35 Further in Rg. 2 a ratio SNR(s), s=0....7 is calculated from input spectrum S(s) and noise spectrum N(s), compo- 
nent by component in calculation t>lock 90 and the ratio is called signal-to-rxMse ratio: 



S/VH(s) = fjfl. 



(13) 



40 



45 



SO 



The sigrial-to-noise ratios SN R(s) represent a kind of voice activity decisions for each frequericy band of the calcu- 
lation spectrtim components. From the signal-to-noise ratios SNR(s) it can be determined whether the frequency t>and 
signal contains speech or noise and accordingly it irxiicates voice activity The calculation block 90 is also preferably 
realized digitally, and it cam'es out the atxive division. Carrying out a divisk)n digitally is as such prior known to a person 
skilled in the art 

In Fig. 2 relative noise level is calculated in bkxk 70. which is nfx>re ctosely presented in figure 10, and in which the 
time averaged mean value for speech S(n) is calculated using the power spectrum estimate S(s), S=0,...7. The time 
averaged mean value S{n) is updated when speech is detected. First the mean value 'S{n) of power spectrum compo- 
nents in the present frame is calculated in block 71 , into which spectrum components S(s) are obtained as an input from 
block 60, as follows: 



55 



S{n)=l2:S(s). 



(14) 



fi»0 



The time averaged mean value S{n) is obtained by calculating in block 72 (ag., recursively) based upon a time 
averaged mean value S(n - 1)fbr the previous frame, which is obtained from menrK>ry 78, In which the calculated time 
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averaged mean value has been stored during the previous frame, the calculation spectrum mean value 'S{n) obtained 
from block 71 , and time constant a which has been stored In advance in memory 7da: 

S(n) = a S(n-^)^^ -a)S(n), (15) 

5 

in which n is the order number of a frame and a is said time constant, the value of which is from 0.0 to 1 .0, typically 
between 0.9 to 1 .0. In order not to contain very weak speech in the time averaged mean value (e.g. at the end of a sen- 
tence), it is updated only if the mean value of the spectrum components for the present frame exceeds a threshold value 
dependent on time averaged mean value. This threshold value is typically one quarter of the time averaged mean value. 
10 The calculation of the two previous equations Is preferat)ly executed digitally. 

CorresporxJtngly, the time averaged mean value of noise power A/Q is obtained from calculation block 73 by using 
the power spectrum estimate of noise N(s), s=0...,7 and component mean value TJln) calculated from it according to 
the next equation: 

75 ' ^ ' W(J = pA/{n-1H1-P)M'i). (16) 

in which p is a time constant the value of which Is 0.0. to 1 .0, typically between 0.9 to 1 .0. The noise pow^ time aver- 
aged mean value Is updated in each frame. The mean value of the noise spectrum components T]{n) Is calculated in 
block 76. based upon spectrum components N(s). as folbws: 



20 



2S 



30 



35 



A/(n)=lXMs) (17) 



and the noise power time averaged mean value A/ (n - 1) for the prevk)us frame is obtained from memory 74, in which 
It was stored during ttie previous frame. The relative noise level t) Is calculated In block 75 as a scaled and maximum 
limited quotient of the time averaged mean values of noise and speech 

, = mta(».ax.,.K|]. (18) 



in which k is a scaling constant (typical value 4.0). which has been stored in advarve in memory 77, and max_n is the 
maximum value of relative noise level (typically 1 .0), which has t>een stored In memory 79bL 

For producing a VAD dedsion in ttie device In Fig. 2, a distance Qs^f^ between input signal and noise model is cal- 
culated in the VAD decision block 1 10 utilizing signal-to-noise ratio SNR(s) , which by digital cak;ulation realizes the fol- 
40 lowing equation: 

s_/> 

DsNR^ Z^sSNRis); (19) 

45 

In which sj and s_h are the index values of the lowest and highest frequency components included and Vg = compo- 
nent weighting coefficient, which are predetermirted arxi stored in advarice in a memory, from which th^ are retrieved 
for calculation. Typically, all signal-to-noise estimate value components are used {sj=0 arxJ s_h=7), and they are 

so weighted equally: = 1 .0/B.O; s=0...,7. 

The follGwing Is a closer description of the embodiment of a VAD dec^on block 1 10. with reference to figure 12. A 
summing unit 1 1 1 in the voice activity detector sums the values of the signal-to-noise ratios SNR/'sj, obtained from dif- 
ferent frequency bancte. whereby the parameter Dsnr* deserving the spectrum distance between input signal and 
noise model. Is obtained according to the atx3ve equation (19). and ttie value Og^^ from the summing unit 1 1 1 is com- 

55 pared with a predetermined threshold value vth in comparata unit 1 1 2. If the threshold value vf/7 is exceeded, the frame 
Is regarded to contain speech. The summing can also t>e weighted In such a way that more weight ^ given to ttie fre- 
querKtes. at which the signal-to-noise ratio can be expected to be good. The output and decision of the voice activity 
detector can be presented with a variable Vjnd> for the values of which ttie following conditions are obtained: 
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25 



30 



vth 



Because the VAD oontrols the updating of background spectrum estiniate N(s), and the latter on its behalf affects . 
10 the function 6i the voice activity detector in a way described atxyve. it s possible that txjth noise and speech is Indicated 
as speech (Vjnct=,i) rf the background rtoise level suddenly increases. This further inhibits update of the t>ackground 
spectrum estimate N(s}, To prevent this, the time (number of frames) during which subsequent frames are regarded not 
to contain speech is monitored. Subsequent frames, which are stationary and are not indicated voiced are assumed not 
to contain speech. 

15 In block 7 in figure 2, Long Term Prediction (LTP) analysis, which is also called pitch analysis, is calculated. Vbiced 
detection is done using long term predictor parameters. The long term precfictor parameters are the lag (i e. pitch 
period) and the long term predictor gain! Those parameters are calculated in most of the speech coders. Thus if a voice 
activity detector is used beskies a speech codec (as descrit)ed in Rg. 5), those parameters can be obtained from the 
speech codec. 

20 The k>ng term prediction analysis can be calculated from an anrK>unt of samples M which equals frame length N, or 
the input frame length can be divided to sut>-frames (e.g. 4 sut}-frames, 4*M=N ) and long term parameters are calcu- 
lated separately from each sub-franr^. The division of the input frame into these suthframes done in the LTP analysis 
block 7 (Fig. 2). The sut>-frame samples are denoted xs(i). 

Accordingly, in block 7 first auto-correlatton R(l) from the sub-frame samples xs(i) is calculated. 



M 

R(i)=Z'^{i)'XS(i'l) (21) 



where 

kLmin,...,Lmax(e.g. Lmin=40, Lmax=160) 

35 Last Lmax samples from the old sub-frames must t>e saved lor the above mentioned cak;ulation. 

Then a maximum value Rmax from the R(l) is searched so tfiat Rmax=max(R(l)), where 1=40,.. .,160. 
The long term predictor lag LTPJag(j} is the index I with corresponds to Rmax. \Aariat>le j indcates the index of the 
sutyframe (j=0..3). 

LTP^gain can be calculated as fbltaws: 
40 LTPjgain(f)=Rmax/Rtol 
where 

N 

Rtot = X Jf5(/ - LTPJagU)) ^ (22) 

45 /=0 



A parameter presenting the long term predictor lag gain of a frame (LTP_>gain_sum) can be calculated by summing 
the long term precfictor lag gains of the sut>-frames (LTP_jgain(j)) 

50 

3 

LTP _gain_sum = ^LTP _9ain(j) (23) 

55 

If the LTP_jgain_sum is higher tfian a fixed threshold thr_tag, the frame is indicated to be voiced: 
It (LTP_gain_sum > thr Jag) 
voiced = 1 

else 
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voiced = 0 

Further in Rg 2 an average noise spectrum estimate NA(s) ts calculated in tHock 1 00 as follows: 

AM„(s) = aAM„.i(sWI-a)S(s) s = 0 7 (24) 

5 

where a is a time constant of value 0<a<1 (e.g. 0,9). 

Also a spectrum distance D t)^een the average noise spectrum estimate NA(8) and the spectrum estimate S(8) 
is calculated in block 100 as follows: 

10 7 

O = Y waxlNA(s)Ms)) ^25) 

^ann{NA{s),S(slLow_Umit) ^ ' 



IS Low^Limit is a small constant, which is used to keep the division result small when the noise spectrum or the signal 
spectrum at some frequency band is kwv. 

If the spectrum distance D e larger tfian a predetermined threshold Dlim, a stationarity counter stat_cnt is set to 
zera If the spectrum distance D is smaller that the threshold Dlim and the signal is not detected voiced (voiced = 0). the 
stationarity counter is incremented. The following conditions are received for the statk)narity counter: 
20 If (D> Dlim) 
stat_cnt = 0 
if (D<Dlim and voiced =0) 
stat_cnt = stat_cnt+1 

Block 100 gives an output stat_cnt which is reset to zero when Vjnd gets a value 0 to meet the following condition: 

2S if(Vind = 0) 

statjcnt =0 

If this number of 8ut>sequent frames exceeds a predetermined threshold value maxjspf, the value of which is e.g. 
50, the value of STcount is set at 1. This provides the following conditk>ns for an output STcount relation to the 
counter value stat.cnt 
30 If (stat_cnt > max_spf) 
STcount = 1 

else 

S*^COUNT = 0 

Additk>nally, in the invention the accuracy of background spectrum estimate N(s) is enhanced by adjusting said 
35 threshokJ value vth of the voice activity detector utilizing relative noise level ii (which is calculated in block 70). In an 
environment in which the signal-to-notee ratio is v^ good (or the relative noise level r\ is tew), the value of ttie threshold 
vth is increased based upon the relative noise level r\. Hereby interpreting rapid changes in t>ackground noise as 
speech is reduced. Adaptation of the thresKoM value vth is earned out in block 113 according to the following: 

40 vth^ = nfWx(Wrt_m/n1, vth_fix^ - vth_slope^ *r\), (26) 



in which vthjixl, vth_min1, amd vthjslopel are positive constants, typical values for which are e.g.: vth_fix1=2,5; 
vth_min1=2.0; vth_sbpe1=S.O. 

In an environment with a high noise level, the threshoU ^ decreased to decrease the probability that speech is 
45 detected as noise. The mean value of the noise spectrum components A/(fJ is then used to decrease the threshoW vth 
as follows 



vth2 = min( vth 1 , vthjix2 • vth_sfope2 • N{ n)) (27) 

50 in which vthj ix2 and vth_slope2 are positive constants. TTius if the mean value of the noise spectrum components N(j^ 

is large enough, the threshold vht2 is ioMer that the theshold vthi . 

The voice activity detector according to the invention can also t>e enhanced in such a way that the threshold vth2 

is further deaeased during speech bursts. TTiis enhances the operation, t>ecause as speech is slcwly becoming more 

quiet it could happen otherwise that the end of speech will be taken for noise. The additional threshold adaptation can 
55 be implemented in the following way (in block 113): 

Rrst, Dsf^f^ is limited b^een the desired maximum (typically 5) and minimum (typically 2) values according to the fol- 

iGwing conditions: 
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D=D„ 
ifD>D^ 



After this a threshold adaptation coefficient tao is calculated by 



'a© = ^'Jmax "n Zo~ ^^^"«x ''''min)' (28) 



10 where th fj^in and f^rnax minimum (typically 0.5) and maximum (typically 1) scaler values, respectively. 

The actual scaler for frame n. ta(n), is calculated by smoothing tao with a filter with different time constants for 
increasing and decreasing values. The snKXJthing may be performed according to following equations: 
if tao > ta{n-^) 

ta(n) = Ao/a(n-1)+(1-Xo)fao (29) 

75 else 

ta(n)=X^ta{n^^)+{•\-X^)tao 

Here Xq and X.i are the attack (inaease period; typical value 0.9) and release (decrease period; typical value 0.5) 
time constants. Rnally. the scaler ta(n) can be used to scale the threshold vth in order to obtain a new VAD threshold 
value vth, whereby 

20 

Vth = ta{n) • vthZ (30) 

An often oocunring problem in a voice activity detector is that just at the t>egnning of speech the speech is not 
detected immediately and also the end of speech is not detected correctly This, on its behalf, causes that the back- 
us ground noise estimate Nls) gets an incorrect value, which again affects later results of the voice activity detector. This 
problem can t>e eliminated by updating the backgrourxl noise estimate using a delay In this case a certain number N 
(ag. N=2) of power spectra (here calculation spectra) S^(s),...,Si^s) of the last frames are stored (e.g. in a buffer imple- 
mented at the input of bfock 80, not shown in figure 1 1) before updating the background noise estimate N(s). If during 
the last double amount of frames (or during frames) the voice activity detector has not deeded speech, the back- 
30 ground noise estimate N(s) is updated with the oldest power spectrum S^(s) in memory, in any other case updating is 
not done. With this it is ensured, that N frames before and after the frame used at updating have been noise. 

The method according to the invention and the device for voice activity detection are particularly suitable to be used 
in communication devices such as a mobile station or a nrK)bile communication system (e.g. in a base station), and they 
are not Umited to any particular architecture (TDMA. CDMA, digital/analog). Figure 13 presents a mobile station accord- 
as ing to the invention, in which voice activity detection according to the invention is employed. The speech signal to t>e - 
transmitted, coming from a microphone 1 , is sampled in an A/D converter 2, is speech coded in a speech codec 3, after 
which base frequerK;y signal processing (e.g. channel encoding, interieaving), mixing and nxxiutation into radio fre- 
quency and transmittance is performed in bfock TX. The voice activity detector 4 (VAD) can t>e used for controlling dis- 
continous transmission by controlling block TX according to the output Vjp^ of the VAD. If the nK)bile station includes an 
40 echo and/or noise canceller ENC, the VAD 4 according to tiie inventfon can also be used in controlling bfock ENC. From 
block TX the signal is transmitted through a duplex filter DPLX and an antenna ANT. The known operations of a recep- 
tion branch RX are carried out for speech received at reception, arxi it is repeated through loudspeaker 9. The VAD 4 
could also t>e used for controlling any reception branch RX operations, e.g. in r^atfon to echo cancellation. 

Here realization and embodinnents of the invention have been presented by examples on the method and the 
45 device. It is evident for a person skilled in ttie art that the invention is not limited to ttie details of the presented enrtxxl- 
iments and that the invention can be realized also in another form without deviating from the characteristics of the inven- 
tion. The presented embodiments should only be regarded as illustrating, not limiting. Thus the possitxiities to realize 
and use the invention are limited only by the enclosed claims. Hereby (Afferent alternatives for the inrplementing of ttie 
invention defined by the claims, including equivalent realizations, are included in the scope of the invention. 

50 

Claims 

1 . A voice activity detection device comprising 

55 means for detecting voice activity in an input signal (x(n)), arxj 

means for making a voice activity decision (V{pd) on t)asis of the detection, characterized in tiiat it comprises 

means (6) for dividing said input si^ml (x(n)) in sut)signals (S(s)) representing specific frequency bands, 
means (80) for estimating noise (N(s)) in the sut>signals, 
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means (90) for calculating subdedsion signals (SNR(s)) on baste of the noise in the subslgnals, and 
means (1 10) for making a voice activity decision (Vj^d) for the inpxjt signal on basis of the sutxJecision sig- 
nals. 

2. A voice activity detection device according to claim 1 . characterized in that it comprises means (90) for calculating 
a signal-to-noise ratio (SNR) for each subsignal and for providing said signal-to-noise ratios as sutxledsion signals 
(SNR(s)). 

3. A voice activity detection device according to daim 2, characterized in that the means (110) for making a voice 
activity decision (V^nd) ^ the input signal comprises 

means (1 1 1) for aeating a value (Dsnr) based on said signal-to-notse ratios (SNR(s)), and 

means (1 12) for comparing said value (Dsnr) with a threshold value (vth) and for outputting a voice activity 

decision signal (Vj^d) on basts of said comparison. 

4. A voice activity detection device according to daim 1 , characterized in that it comprises means (70) for determin- 
ing the mean level of a noise compon^t and a speech corrponent (N ,S) contained in the input signal, and means 
(113) for acQusting said threshold value (vth) based upon the mean level of the noise component and the speech 
corrponent (A/.S). 

5. A voice activity detection device according to daim 2, characterized in tirat it corrprises means (1 1 3) for adjusting 
said threshdd value (vth) based ipon past signal-to-noise ratios (SNR(s)). 

6. A voice activity detection device according to daim 2, characterized in that it comprises means (80) for storing the 
value of the estimated noise (N(s)) and said noise (N(s)) is updated with past subsignals (S(s)) depending on past 
and present signal-to-noise ratios (SNR(s)). 

7. A voice activity detection device acoording to daim 1 , characterized in that it comprises means (3) for calculating 
linear prediction coeffidents based on the input signal (x(n)), and means (8) for calculating said subsignals (S(s)) 
based on said linear predkition coeffidents. 

8. A voice activity detection device according to daim 1 , characterized in that it comprises 

means (7) for calculating a fong term prediction analysis produdng long term predictor parameters, said 
parameters induding long term predictor gain (LTP_gain_sum). 

means (7) for comparing said fong term predictor gain with a threshoki value (thrjag), arxl 
means for producing a vofoed detection dedsion on basis of said comparison. 

9. A rndbi\e station for transmission and reception of speech messages, comprising 

means for detecting voice activity in a speech message (x(n)), and 

means for making a voice activity dedsion (Vj^^) on basis of the detection, characterized in that it corrprises 

means (6) for dividing said speech message (x(n)) in subsignals (S(s)) representing specific frequency 
bands, 

means (80) for estimating noise (N(s)) in the sut)signals. 

means (90) for calculating subdedsfon signals (SNR(s)) on basis of tire ndse in the subsignals, and 
means (1 10) for making a vdce activity decision (V|nd) for tiie input signal on basis of the siixJedsion sig- 
nals. 

10. A nr^od of detecting voice activity in a communication device, tiie method corrprising the steps of: 

receiving an input signal (x(n)), 

detecting voice activity in the input signal, and 

making (1 10) a voice activity dedsfon (Vjnd) on basis of the detection, characterized in that it corrprises 
dividing (6) said input signal in sut>signals (S(s)) representing specific frequency bands, 
estimating noise (N(s)) in the sut>signals, 

cafoulating (90) sutxiedsion signals (SNR(s)) on basis of the noise in the sut>signals. and 
making (1 10) a voice activity dedsfon (Vind) for the input signal on basis of the subdedsfon signals. 
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